suppressPackageStartupMessages(library(hdf5r))
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(scDblFinder))
suppressPackageStartupMessages(library(tidyverse))

Define the arrays to load into memory

The input parameters should specify the location of the H5 raw (unfiltered) cellranger output for preprocessing and the name of the associated dataset, which will be carried forward into downstream analyses. This step also adds an identifier to the label for each cell to ensure they maintain the identity of their respective sample, regardless of file format. Uses a dash and underscore for string splitting downstream to produce metadata for each in terms of the condition and batch.

Arrays were produced using CellRanger 4.0.0 with default settings, using the mm10 2020-A reference (modified vM23/Ens98 annotation).

path <- params$input_path
id <- params$dataset

array <- Read10X_h5(filename = path)
colnames(array) <- paste0(id, "_", colnames(array))

Do basic first-level cell calls using emptyDrops

Uses the approach from Lun et al., 2019 to filter out clearly empty droplets using a dirichlet-multinomial model. Filtration is performed with an FDR<0.001, which results in liberal cell calling to be further filtered downstream. To avoid occassional cases where obvious cells get called as background, all barcodes with >1000 UMIs are assumed non-empty.

cat(paste("Called barcodes:", length(colnames(array))), 
    paste("Mean UMIs/called barcode:", sum(array)/length(colnames(array))),
    paste("Mean Features/called barcode:", sum(array != 0)/length(colnames(array))),
    paste("Percent reads in called barcodes:", plot_array$UMIs_in_cells), 
    paste("Knee point:", plot_array$Knee_threshold),
    paste("Inflection point:", plot_array$Inflection_threshold), sep = "\n")
Called barcodes: 73931
Mean UMIs/called barcode: 1712.39009346553
Mean Features/called barcode: 736.628532009576
Percent reads in called barcodes: 0.969417848398536
Knee point: 9213
Inflection point: 1158

Format into Seurat object, add metadata for regions and quality metrics*

Seurat objects are by far the most versatile and easiest to perform pre-processing on, so all subsequent steps will work with Seurat-formatted data matrices. This step also uses aforementioned string splitting to extract the condition, and it also determines mitochondrial and ribosomal read proportion.

count_stats("counts/cell", array@meta.data$nCount_RNA)
Min counts/cell 101
Max counts/cell 68912
Mean counts/cell 1712.39009346553
Median counts/cell 386
Std. dev. counts/cell 4136.31268011911
Std. error counts/cell 15.2124817796177
count_stats("features/cell", array@meta.data$nFeature_RNA)
Min features/cell 98
Max features/cell 8382
Mean features/cell 736.628532009576
Median features/cell 330
Std. dev. features/cell 1091.95480543401
Std. error features/cell 4.01597844903559
count_stats("mitochondrial %/cell", array@meta.data$Mito_proportion)
Min mitochondrial %/cell 0
Max mitochondrial %/cell 0.0573476702508961
Mean mitochondrial %/cell 0.00267811482820801
Median mitochondrial %/cell 0.00240384615384615
Std. dev. mitochondrial %/cell 0.00299081575586516
Std. error mitochondrial %/cell 1.09995867602016e-05
count_stats("ribosomal %/cell", array@meta.data$Ribo_proportion)
Min ribosomal %/cell 0
Max ribosomal %/cell 0.0396039603960396
Mean ribosomal %/cell 0.00517779632138434
Median ribosomal %/cell 0.00453514739229025
Std. dev. ribosomal %/cell 0.00404804826171147
Std. error ribosomal %/cell 1.48878639471051e-05

Perform complexity filtering

Remove cells with fewer than 1000 features. Some differentially filter cells and glia, but doing both at 1000 should be far less complex, and sufficient for our purposes.

require(Seurat)

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$nFeature_RNA, array$nFeature_RNA >= 1000)
colnames(plot_array) <- c("UMIs", "Genes", "Filtered")

# Remove all cells with fewer than 1000 UMIs
array <- subset(array, cells = colnames(array)[array$nFeature_RNA >= 1000])

# Output new QC metrics
paste("Cells passing 1000 gene complexity filter:", length(colnames(array)))
[1] "Cells passing 1000 gene complexity filter: 6151"
paste("UMIs/cell passing 1000 gene complexity filter:", mean(array$nCount_RNA))
[1] "UMIs/cell passing 1000 gene complexity filter: 7889.12225654365"
paste("Features/cell passing 1000 gene complexity filter:", mean(array$nFeature_RNA))
[1] "Features/cell passing 1000 gene complexity filter: 2808.8858722159"
paste("% mitochondrial/cell passing 1000 gene complexity filter:", mean(array$Mito_proportion))
[1] "% mitochondrial/cell passing 1000 gene complexity filter: 0.000282866245190432"
paste("% ribosomal/cell passing 1000 gene complexity filter:", mean(array$Ribo_proportion))
[1] "% ribosomal/cell passing 1000 gene complexity filter: 0.00183036037082939"
plot_ly(data = plot_array, x = ~UMIs, y = ~Genes, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("1000 Gene Complexity Filter:", id))
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels

Perform quality filtering: mitochondrial reads

According to pipeComp (Germain et al., 2020), removing mitochondrial outliers mildly increases pipeline performance if done downstream with SCTransform. This cell filters out high outlier nuclei where their mitochondrial read proportion exceeds Q3+5xIQR.

require(Seurat)

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$Mito_proportion, 
                               array$Mito_proportion >= quantile(array$Mito_proportion)[4]+5*IQR(array$Mito_proportion))
colnames(plot_array) <- c("UMIs", "Mitochondrial_proportion", "Filtered")

# Remove all extreme high outliers for mitochondrial reads, defined as Q3+5*IQR, which is an EXTREMELY lenient standard 
array <- subset(array, cells = colnames(array)[array$Mito_proportion <=
                                                 quantile(array$Mito_proportion)[4]+5*IQR(array$Mito_proportion)])

paste("Cells passing mitochondrial outlier filter:", length(colnames(array)))
[1] "Cells passing mitochondrial outlier filter: 6126"
paste("UMIs/cell passing mitochondrial outlier filter:", mean(array$nCount_RNA))
[1] "UMIs/cell passing mitochondrial outlier filter: 7911.1629121776"
paste("Features/cell passing mitochondrial outlier filter:", mean(array$nFeature_RNA))
[1] "Features/cell passing mitochondrial outlier filter: 2814.70682337578"
paste("% mitochondrial/cell passing mitochondrial outlier filter:", mean(array$Mito_proportion))
[1] "% mitochondrial/cell passing mitochondrial outlier filter: 0.000266849517788897"
paste("% ribosomal/cell passing mitochondrial outlier filter:", mean(array$Ribo_proportion))
[1] "% ribosomal/cell passing mitochondrial outlier filter: 0.0018288011798363"
plot_ly(data = plot_array, x = ~UMIs, y = ~Mitochondrial_proportion, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("Mitochondrial Outlier Filter:", id))
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels

Perform quality filtering: ribosomal reads

According to pipeComp (Germain et al., 2020), removing ribosomal outliers mildly increases pipeline performance if done downstream with SCTransform and in tandem with removing mitochondrial read outliers. This cell filters out high outlier nuclei where their ribosomal read proportion exceeds Q3+5xIQR.

require(Seurat)

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$Ribo_proportion, 
                               array$Ribo_proportion >= quantile(array$Mito_proportion)[4]+5*IQR(array$Ribo_proportion))
colnames(plot_array) <- c("UMIs", "Ribosomal_proportion", "Filtered")

# Remove all extreme high outliers for mitochondrial reads, defined as Q3+5*IQR, which is an EXTREMELY lenient standard 
array <- subset(array, cells = colnames(array)[array$Ribo_proportion <=
                                                 quantile(array$Ribo_proportion)[4]+5*IQR(array$Ribo_proportion)])

paste("Cells passing ribosomal outlier filter:", length(colnames(array)))
[1] "Cells passing ribosomal outlier filter: 6121"
paste("UMIs/cell passing ribosomal outlier filter:", mean(array$nCount_RNA))
[1] "UMIs/cell passing ribosomal outlier filter: 7911.61770952459"
paste("Features/cell passing ribosomal outlier filter:", mean(array$nFeature_RNA))
[1] "Features/cell passing ribosomal outlier filter: 2814.67946413985"
paste("% mitochondrial/cell passing ribosomal outlier filter:", mean(array$Mito_proportion))
[1] "% mitochondrial/cell passing ribosomal outlier filter: 0.000266931758172702"
paste("% ribosomal/cell passing ribosomal outlier filter:", mean(array$Ribo_proportion))
[1] "% ribosomal/cell passing ribosomal outlier filter: 0.00182286788770106"
plot_ly(data = plot_array, x = ~UMIs, y = ~Ribosomal_proportion, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("Ribosomal Outlier Filter:", id))
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels

TEMPORARY UNTIL SOLO WORKS: remove doublets

require(Seurat)
require(scDblFinder)
require(tidyverse)

sce <- as.SingleCellExperiment(array) %>%
    scDblFinder
Clustering cells...
Identifying top genes per cluster...
Creating ~6121 artifical doublets...
Too many clusters - will create triplets only for the 10 largest clusters.Finding KNN...
Evaluating cell neighborhoods...
Finding threshold...

plot_array <- cbind.data.frame(array$nCount_RNA, 
                               array$nFeature_RNA,
                               sce$scDblFinder.class)
colnames(plot_array) <- c("UMIs", "Features", "Doublet_status")

array <- subset(array, cells = colnames(array)[sce$scDblFinder.class == "singlet"])

paste("Cells passing doublet filter:", length(colnames(array)))
[1] "Cells passing doublet filter: 5769"
paste("UMIs/cell passing doublet filter:", mean(array$nCount_RNA))
[1] "UMIs/cell passing doublet filter: 7585.19188767551"
paste("Features/cell passing doublet filter:", mean(array$nFeature_RNA))
[1] "Features/cell passing doublet filter: 2743.21840873635"
paste("% mitochondrial/cell passing doublet filter:", mean(array$Mito_proportion))
[1] "% mitochondrial/cell passing doublet filter: 0.000270180608499699"
paste("% ribosomal/cell passing doublet filter:", mean(array$Ribo_proportion))
[1] "% ribosomal/cell passing doublet filter: 0.00183325269026899"
plot_ly(data = plot_array, x = ~UMIs, y = ~Features, color = ~Doublet_status) %>%
  add_markers %>%
  layout(title = paste("Doublet Filter:", id))
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels
minimal value for n is 3, returning requested palette with 3 different levels

Save pre-processed matrix for use downstream

saveRDS(array, file = paste0("../cellranger_h5_allraw/", id, "_preprocessed.rds"))
---
title: "Preprocessing Notebook"
author: "James Howe"
output: html_notebook
params:
  date: "2020-08-31"
  dataset: "plCoAa-1-R1"
  input_path: "../datasets/cellranger_h5_raw/plCoAa_1_R1_RNA.h5"
  output_path: "../datasets/preprocessed_datasets"
---

```{r setup}
suppressPackageStartupMessages(library(hdf5r))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(DropletUtils))
suppressPackageStartupMessages(library(scDblFinder))
suppressPackageStartupMessages(library(tidyverse))

count_stats <- function(x, y){
  cat(paste("Min", x, min(y)),
      paste("Max", x, max(y)),
      paste("Mean", x, mean(y)),
      paste("Median", x, median(y)),
      paste("Std. dev.", x, sd(y)),
      paste("Std. error", x, sd(y)/sqrt(length(y))), 
            sep = "\n")
}
```

## Define the arrays to load into memory

The input parameters should specify the location of the H5 raw (unfiltered) cellranger output for preprocessing and the name of the associated dataset, which will be carried forward into downstream analyses. This step also adds an identifier to the label for each cell to ensure they maintain the identity of their respective sample, regardless of file format. Uses a dash and underscore for string splitting downstream to produce metadata for each in terms of the condition and batch.

Arrays were produced using CellRanger 4.0.0 with default settings, using the mm10 2020-A reference (modified vM23/Ens98 annotation). 

```{r 1-read_h5, warning = FALSE}
path <- params$input_path
id <- params$dataset

array <- Read10X_h5(filename = path)
colnames(array) <- paste0(id, "_", colnames(array))
```

## Do basic first-level cell calls using emptyDrops

Uses the approach from [Lun et al., 2019](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y) to filter out clearly empty droplets using a dirichlet-multinomial model. Filtration is performed with an FDR<0.001, which results in liberal cell calling to be further filtered downstream. To avoid occassional cases where obvious cells get called as background, all barcodes with >1000 UMIs are assumed non-empty. 

```{r 2-cell_barcode_call}
# run emptyDrops
barcode_filter <- emptyDrops(array, retain = 1000)
barcode_filter$FDR[is.na(barcode_filter$FDR)] <- 1
knee_ranks <- barcodeRanks(array)

# export metrics
plot_array <- list(cbind.data.frame(knee_ranks$rank, 
                                 knee_ranks$total, 
                                 barcode_filter$FDR),
                     sum(barcode_filter$FDR < 0.001),
                     (sum(array[,barcode_filter$FDR < 0.001]) / sum(array)),
                     knee_ranks@metadata$knee,
                     knee_ranks@metadata$inflection)
names(plot_array) <- c("Metrics", "Total", "UMIs_in_cells", "Knee_threshold", "Inflection_threshold")
colnames(plot_array$Metrics) <- c("Rank", "UMIs", "FDR")
rownames(plot_array$Metrics) <- c(knee_ranks@rownames)

# actually filter the array
array <- array[, plot_array$Metrics$FDR < 0.001]

# give first stage QC metrics
cat(paste("Called barcodes:", length(colnames(array))), 
    paste("Mean UMIs/called barcode:", sum(array)/length(colnames(array))),
    paste("Mean Features/called barcode:", sum(array != 0)/length(colnames(array))),
    paste("Percent reads in called barcodes:", plot_array$UMIs_in_cells), 
    paste("Knee point:", plot_array$Knee_threshold),
    paste("Inflection point:", plot_array$Inflection_threshold), sep = "\n")

# remove duplicates, will crash notebook and computer if retained in plot
plot_array$Metrics <- plot_array$Metrics[!duplicated(plot_array$Metrics$Rank),]
plot_array$Metrics <- plot_array$Metrics[order(plot_array$Metrics$Rank),]

# produce knee plot for output of emptyDrops
plot_ly(data = plot_array$Metrics, x = ~Rank, y = ~UMIs, color = ~FDR) %>%
  add_markers %>%
  layout(title = paste("UMI Elbow Plot:", id), 
         xaxis = list(type = "log"), yaxis = list(type = "log"))
```

## Format into Seurat object, add metadata for regions and quality metrics*

Seurat objects are by far the most versatile and easiest to perform pre-processing on, so all subsequent steps will work with Seurat-formatted data matrices. This step also uses aforementioned string splitting to extract the condition, and it also determines mitochondrial and ribosomal read proportion.

```{r}
array <- CreateSeuratObject(array)
  
# add non-nuclear read proportion metadata
Mito_proportion <- Matrix::colSums(array[grepl("^mt-|-mt-", rownames(array)),]) / array$nCount_RNA
array <- AddMetaData(array, Mito_proportion, col.name = "Mito_proportion")

Ribo_proportion <- Matrix::colSums(array[grepl("rpl|rps", rownames(array)),]) / array$nCount_RNA
array <- AddMetaData(array, Ribo_proportion, col.name = "Ribo_proportion")

# add group information by string splitting ID, removing replicate
array <- AddMetaData(array, strsplit(id, "-")[[1]][1], col.name = "region")

head(array@meta.data, 3)

count_stats("counts/cell", array@meta.data$nCount_RNA)
count_stats("features/cell", array@meta.data$nFeature_RNA)
count_stats("mitochondrial %/cell", array@meta.data$Mito_proportion)
count_stats("ribosomal %/cell", array@meta.data$Ribo_proportion)
```

*Perform complexity filtering*

Remove cells with fewer than 1000 features. Some differentially filter cells and glia, but doing both at 1000 should be far less complex, and sufficient for our purposes. 

```{r}

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$nFeature_RNA, array$nFeature_RNA >= 1000)
colnames(plot_array) <- c("UMIs", "Genes", "Filtered")

# Remove all cells with fewer than 1000 UMIs
array <- subset(array, cells = colnames(array)[array$nFeature_RNA >= 1000])

# Output new QC metrics
paste("Cells passing 1000 gene complexity filter:", length(colnames(array)))
paste("UMIs/cell passing 1000 gene complexity filter:", mean(array$nCount_RNA))
paste("Features/cell passing 1000 gene complexity filter:", mean(array$nFeature_RNA))
paste("% mitochondrial/cell passing 1000 gene complexity filter:", mean(array$Mito_proportion))
paste("% ribosomal/cell passing 1000 gene complexity filter:", mean(array$Ribo_proportion))

plot_ly(data = plot_array, x = ~UMIs, y = ~Genes, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("1000 Gene Complexity Filter:", id))
```

*Perform quality filtering: mitochondrial reads*

According to pipeComp (Germain et al., 2020), removing mitochondrial outliers mildly increases pipeline performance if done downstream with SCTransform. This cell filters out high outlier nuclei where their mitochondrial read proportion exceeds Q3+5xIQR.

```{r}

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$Mito_proportion, 
                               array$Mito_proportion >= quantile(array$Mito_proportion)[4]+5*IQR(array$Mito_proportion))
colnames(plot_array) <- c("UMIs", "Mitochondrial_proportion", "Filtered")

# Remove all extreme high outliers for mitochondrial reads, defined as Q3+5*IQR, which is an EXTREMELY lenient standard 
array <- subset(array, cells = colnames(array)[array$Mito_proportion <=
                                                 quantile(array$Mito_proportion)[4]+5*IQR(array$Mito_proportion)])

paste("Cells passing mitochondrial outlier filter:", length(colnames(array)))
paste("UMIs/cell passing mitochondrial outlier filter:", mean(array$nCount_RNA))
paste("Features/cell passing mitochondrial outlier filter:", mean(array$nFeature_RNA))
paste("% mitochondrial/cell passing mitochondrial outlier filter:", mean(array$Mito_proportion))
paste("% ribosomal/cell passing mitochondrial outlier filter:", mean(array$Ribo_proportion))

plot_ly(data = plot_array, x = ~UMIs, y = ~Mitochondrial_proportion, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("Mitochondrial Outlier Filter:", id))
```

*Perform quality filtering: ribosomal reads*

According to pipeComp (Germain et al., 2020), removing ribosomal outliers mildly increases pipeline performance if done downstream with SCTransform and in tandem with removing mitochondrial read outliers. This cell filters out high outlier nuclei where their ribosomal read proportion exceeds Q3+5xIQR.

```{r}

# Store pre-filter population metrics for plotting
plot_array <- cbind.data.frame(array$nCount_RNA, array$Ribo_proportion, 
                               array$Ribo_proportion >= quantile(array$Mito_proportion)[4]+5*IQR(array$Ribo_proportion))
colnames(plot_array) <- c("UMIs", "Ribosomal_proportion", "Filtered")

# Remove all extreme high outliers for mitochondrial reads, defined as Q3+5*IQR, which is an EXTREMELY lenient standard 
array <- subset(array, cells = colnames(array)[array$Ribo_proportion <=
                                                 quantile(array$Ribo_proportion)[4]+5*IQR(array$Ribo_proportion)])

paste("Cells passing ribosomal outlier filter:", length(colnames(array)))
paste("UMIs/cell passing ribosomal outlier filter:", mean(array$nCount_RNA))
paste("Features/cell passing ribosomal outlier filter:", mean(array$nFeature_RNA))
paste("% mitochondrial/cell passing ribosomal outlier filter:", mean(array$Mito_proportion))
paste("% ribosomal/cell passing ribosomal outlier filter:", mean(array$Ribo_proportion))

plot_ly(data = plot_array, x = ~UMIs, y = ~Ribosomal_proportion, color = ~Filtered) %>%
  add_markers %>%
  layout(title = paste("Ribosomal Outlier Filter:", id))
```

*TEMPORARY UNTIL SOLO WORKS: remove doublets*

```{r}
require(Seurat)


sce <- as.SingleCellExperiment(array) %>%
    scDblFinder

plot_array <- cbind.data.frame(array$nCount_RNA, 
                               array$nFeature_RNA,
                               sce$scDblFinder.class)
colnames(plot_array) <- c("UMIs", "Features", "Doublet_status")

array <- subset(array, cells = colnames(array)[sce$scDblFinder.class == "singlet"])

paste("Cells passing doublet filter:", length(colnames(array)))
paste("UMIs/cell passing doublet filter:", mean(array$nCount_RNA))
paste("Features/cell passing doublet filter:", mean(array$nFeature_RNA))
paste("% mitochondrial/cell passing doublet filter:", mean(array$Mito_proportion))
paste("% ribosomal/cell passing doublet filter:", mean(array$Ribo_proportion))

plot_ly(data = plot_array, x = ~UMIs, y = ~Features, color = ~Doublet_status) %>%
  add_markers %>%
  layout(title = paste("Doublet Filter:", id))
```

Save pre-processed matrix for use downstream

```{r}
export_path <- params$output_path
saveRDS(array, file = paste0(export_path, id, "_preprocessed.rds"))
```